Dataset from the U.S. Small Business Administration (SBA). For this case-study assignment, students assume the role of loan officer at a bank and are asked to approve or deny a loan by assessing its risk of default using logistic regression.
The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market. Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment. One way SBA assists these small business enterprises is through a loan guarantee program which is designed to encourage banks to grant loans to small businesses. SBA acts much like an insurance provider to reduce the risk for a bank by taking on some of the risk through guaranteeing a portion of the loan. In the case that a loan goes into default, SBA then covers the amount they guaranteed.
There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans. The rate of default on these loans has been a source of controversy for decades. Conservative economists believe that credit markets perform efficiently without government participation. Supporters of SBA-guaranteed loans argue that the social benefits of job creation by those small businesses receiving government-guaranteed loans far outweigh the costs incurred from defaulted loans.
Since SBA loans only guarantee a portion of the entire loan balance, banks will incur some losses if a small business defaults on its SBA-guaranteed loan. Therefore, banks are still faced with a difficult choice as to whether they should grant such a loan because of the high risk of default. One way to inform their decision making is through analyzing relevant historical data such as the datasets provided here.
https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342#.W4TAIi2B00o
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))
from IPython.display import IFrame
import sys
sys.path.insert(0,'../')
from utils.paths import *
import pandas as pd
import numpy as np
from plotly.offline import init_notebook_mode, iplot
import cufflinks as cf
init_notebook_mode()
cf.go_offline()
from __future__ import division
!aws s3 ls s3://eh-home/ehda-calvin/SBA_study/
# User define function
def table(no):
# there are 13 additional table
return pd.read_csv(path_SBA + 't00%02d'%no + '-10.1080%2F10691898.2018.1434342.csv')
def default_cat_table(data, cat):
default_cat = data.groupby([cat, 'default']).count().max(1).unstack()
default_cat['ALL'] = data[cat].value_counts()
default_cat['default_rate'] = (default_cat[1] / default_cat['ALL'])
default_cat = default_cat.rename(columns = {1: 'Default', 0: 'Non-default'})
return default_cat
nat = pd.read_csv(path_SBA + 'SBAnational_new.csv', sep = ';', low_memory=False,
dtype = {'Zip' : str,
'Zip3d' : str,
'Zip5d' : str,
'fips' : str})
nat.head().T
default_cat_table(nat, 'ApprovalFY').ALL.iplot(kind = 'bar', title = 'SBA cases',
yTitle = 'no. of cases', xTitle = 'Year')
default_cat_table(nat, 'ApprovalFY').loc[1990:2014].default_rate.iplot(kind = 'bar', title = 'SBA default rate',
yTitle = 'Default rate', xTitle = 'Year')
nace_group = table(3)
nace_group['sector_group'] = nace_group.Sector.str[:2]
nace_group.head()
default_cat_table2 = default_cat_table(nat, 'NAICS_group').reset_index()
default_cat_table2['NAICS_group'] = default_cat_table2['NAICS_group'].astype(str)
default_cat_table2 = default_cat_table2.merge(nace_group, how = 'left', left_on = 'NAICS_group', right_on = 'sector_group')
use_cols = ['Description', 'NAICS_group', 'default_rate', 'Default', 'ALL']
default_cat_table2[use_cols]
default_cat_table2.set_index('Description').default_rate.sort_values().iplot(kind = 'bar', title = 'Default on sectors',
yTitle = 'Default rate')
default_cat_table(nat, 'BusinessType')
nat['Term_year'] = nat.Term.apply(lambda x: round(x/12))
default_cat_table(nat, 'Term_year').head()
It appears that SBA may put term to 0 if the cases are default. Therefore it is not recommended to use this variable. However, this variable were used to create the other variable 'RealEstate' because banks only approve long loan if the company is an owner of properties.
The last word of the company name may indicate the legal type / type of business
default_cat_table(nat, 'suffix').sort_values('default_rate', ascending = False).head(10)
default_cat_table(nat, 'FranchiseCode')
default_cat_table(nat, 'UrbanRural')
Revolving line of credit
Revolving credit is a line of credit where the customer pays a commitment fee to a financial institution to borrow money, and is then allowed to use the funds when needed. It usually is used for operating purposes and the amount drawn can fluctuate each month depending on the customer's current cash flow needs. Revolving lines of credit can be taken out by corporations or individuals.
https://www.investopedia.com/terms/r/revolvingcredit.asp#ixzz5Vgppy7tP
default_cat_table(nat, 'RevLineCr')
default_cat_table(nat, 'LowDoc')
Program for small loan that require much less documents
Past record was based on if that company has previous appearence in that dataset (e.g. previous loan, when did the first loan made and default history.)
default_cat_table(nat, 'Loan_age')
default_cat_table(nat, 'default_times')
default_fips = default_cat_table(nat, 'fips')
default_fips[default_fips.ALL > 10].sort_values(['ALL', 'default_rate'], ascending = False).head()
fip_ct = nat.fips.value_counts()
fip_ct[fip_ct >= 10].shape
default_fips = nat.groupby(['fips', 'default']).count().max(1).unstack()
default_fips['ALL'] = nat.fips.value_counts()
default_fips = default_fips.reset_index()
default_fips = default_fips.rename(columns = {'index': 'fips',
0: 'Non-default',
1: 'Default',
})
default_fips['Default'] = default_fips['Default'].fillna(0)
default_fips['default_rate'] = (default_fips['Default'] / default_fips['ALL']).round(3)
default_fips_1 = default_fips[default_fips.ALL >= 10]
default_fips.head()
import plotly.figure_factory as ff
colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
"#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
"#08519c","#0b4083","#08306b"]
binpoint = [500, 1000, 3000, 5000, 10000, 20000]
cs = colorscale[0::2]
fig = ff.create_choropleth(fips = default_fips.fips, values = default_fips.ALL, binning_endpoints = binpoint,
colorscale = cs, title='SBA cases', legend_title='no. of cases', round_legend_values=True
)
iplot(fig)
binpoint = [10, 100, 1000, 3000, 5000]
cs = colorscale[0::3]
fig = ff.create_choropleth(fips = default_fips.fips, values = default_fips.ALL, binning_endpoints = binpoint,
colorscale = cs, title='SBA default cases', legend_title='no. of default cases', round_legend_values=True
)
iplot(fig)
binpoint = [0.05, 0.1, 0.2, 0.35, 0.5]
cs = colorscale[0::3]
fig = ff.create_choropleth(fips = default_fips_1.fips, values = default_fips_1.default_rate, binning_endpoints = binpoint,
colorscale = cs, title='SBA Default rate', legend_title='Default rate'
)
iplot(fig)